Uncertainty
Bayesian Multiplicity Correction in the Probabilistic Forward Stepwise Framework
Womack, Andrew, Taylor-Rodriguez, Daniel
We develop a natural Bayesian multiplicity-correcting prior distribution within the probabilistic forward stepwise representation of model space priors for regression problems. The proposed prior, obtained from making an analogy to the Holm procedure, exhibits behavior closely aligned with that of the Matryoshka doll prior. We compare both priors to several other priors, including some recently put forward as objective choices for model space prior probabilities. Our comparisons indicate that adequate multiplicity correction requires a degree of sparsity that many recommended priors do not provide, and we argue that multiplicity correction itself offers a principled and transparent criterion for specifying model space priors in regression.
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
Wan, Shu, Gorantla, Abhinav, Liu, Huan, Candan, K. Selçuk
Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.
On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian Inference
Dold, Daniel, Sommer, Emanuel, Kobialka, Julius, Dürr, Oliver, Rügamer, David
While parameter-efficient fine-tuning methods like low-rank adaptation (LoRA) are standard for large language models, principled estimation of epistemic uncertainty remains challenging. Recent results in the LoRA regime suggest that discrete multi-mode approaches such as deep ensembles offer little benefit over single-mode methods. This contradicts broader observations in deep learning, where ensembling independent optima typically improves generalization, and linking these modes through continuous low-loss valleys further enhances Bayesian model averaging (BMA). Whether such structure exists in the LoRA space and whether it yields functional diversity missed by local or discrete methods has not been studied. We introduce LoRA-Curve, a segmented Bézier curve parameterization in the LoRA space, with two variants: a free configuration that jointly optimizes all control points, and an anchored configuration that connects independently fine-tuned LoRA optima. We prove pathwise continuity and Lipschitz regularity of the loss along the curve and empirically show, across reasoning and classification benchmarks with Qwen2.5 7B, that linear interpolation encounters loss barriers, while our anchored multi-segment curves connect independent optima through continuous low-loss valleys. Combined with flat-minima perturbations and a Jensen-Shannon divergence regularizer, LoRA-Curve yields measurably higher mutual information of the predictive distribution without sacrificing performance, and links continuous parameter-space traversal to functional diversity.
Joint Model and Data Sparsification via the Marginal Likelihood
Timans, Alexander, Möllenhoff, Thomas, Naesseth, Christian A., Khan, Mohammad Emtiyaz, Nalisnick, Eric
Sparse recovery in linear systems underpins applications from signal processing to high-dimensional regression. Sparse Bayesian Learning, grounded in the principle of automatic relevance determination (ARD), offers a practical Bayesian mechanism for feature sparsity via marginal likelihood optimization. Yet, its reliance on a homoscedastic noise model renders it sensitive to data contaminations such as outliers or misspecified noise, harming model fit and predictions. Instead, we propose jointly learning individual feature and sample relevancies, enabling simultaneous model and data sparsification via a single Bayesian objective. This symmetric pruning of model and data offers a natural extension that preserves conjugacy, admits closed-form updates for standard optimization procedures, and aligns with perspectives from robust regression and influence functions. Empirical results across diverse regression tasks affirm that a joint ARD approach consistently yields both sparse and robust prediction models.
Wasserstein Contraction of Coordinate Ascent Variational Inference
Caprio, Rocco, Corenflos, Adrien, Power, Sam
Finding approximations to an intractable probability distribution π of interest (usually known only up to a normalizing constant) is a key problem in scientific computing. Variational Inference stands out as a particularly attractive tool for this task, owing to its statistical and computational efficiency, and it has been the framework underlying many advances in computational statistics over the past half century (Parisi, 1980; Hinton and Van Camp, 1993; Jordan et al., 1999; Bishop and Nasrabadi, 2006). The central idea is to seek a tractable approximation to π within a chosen family of tractable distributions Q by minimizing a divergence to π over that'variational' family. Often, it is convenient or well-motivated to work with the family of product (or tensor, or factorized) distributions Q = P m, and define optimality through minimisation of the Kullback-Leibler (KL) divergence (also'relative entropy') min KL(ϱ||π): ϱ P m . A key practical aspect of working with this particular loss function is that in solving the associated optimisation problem, one is only required to compute expectations under the tractable variational distribution ϱ, rather than under the intractable target distribution π. In Bayesian statistics, π typically represents the joint posterior distribution of latent variables z Z and some parameters β B given observed data y Y. In these cases, we often choose m = 2 and seek the best variational approximation µ(dz) ν(dβ) to π to solve min KL(µ ν||π): µ P(Z), ν P(B) . The coordinate ascent variational inference algorithm (CAVI, Bishop and Nasrabadi, 2006; Blei et al., 2017) solves this problem by iteratively minimizing the Kullback-Leibler divergence with respect to one element at a time: given a starting point ν0, it iterates µk:= argmin
Iterative Causal Discovery: Per-Edge Impossibility Certificates, Tier-Aware Oracle Queries, and the $1+K$ Lower Bound
Causal-discovery algorithms return a directed graph, yet provide no principled means of distinguishing edge directions identified by the data from those assigned without an identifying assumption. Under the standard Markov and faithfulness conditions, the observational distribution identifies only a Markov equivalence class; orientations within that class are not determined by the joint distribution and cannot be recovered from additional samples alone, but require either a functional restriction or an intervention. We introduce a protocol for observational causal discovery on continuous data that attaches to each candidate edge a discrete impossibility certificate: a RESOLVED code records the identifiability theorem under which the direction was committed, while an IMPOSSIBLE code records the failure mode together with the specific question a domain expert must answer to resolve it. The bivariate cascade is extended with five gated identifiability tiers LSNM, IGCI, Stein, MDL, and PEIT that abstain when their precondition test rejects. Two oracle primitives, the meta-hub query and the node-children query, jointly establish an upper bound of $1+K$ expert interactions sufficient to recover any DAG, where $K$ denotes the number of non-leaf vertices. Under an ideal-oracle assumption, the bound is met exactly on the asia, sachs, child, and alarm benchmarks.
GenSBI: Generative Methods for Simulation-Based Inference in JAX
Flow and diffusion generative models have established themselves as widely adopted density estimators for simulation-based inference (SBI), extending naturally from neural posterior estimation to likelihood and joint density estimation. Their principled optimization objectives and freedom from architectural constraints have driven rapid adoption across the natural sciences. Yet the most widely used SBI libraries remain PyTorch-based, leaving researchers who develop their forward models and analysis pipelines in JAX without a native option. We present GenSBI, an open-source library that implements flow matching, score matching, and denoising diffusion entirely in JAX. The library offers three transformer-based architectures -- SimFormer, Flux1, and a novel Flux1Joint that extends gate-modulated transformer blocks to joint density estimation -- all interchangeable through a unified interface that decouples generative method, neural backbone, and inference mode. GenSBI provides an end-to-end workflow from training through posterior calibration (SBC, TARP, LC2ST) and supports custom architectures with domain-specific embedding networks.
Identifiable Bayesian Deep Generative Copulas with Unknown Layer Widths for Data with Arbitrary Marginal Distributions
Deep generative models offer powerful tools for multivariate data analysis, but their black-box architectures are often unidentified and difficult to interpret. We introduce the Deep Discrete Encoder (DDE) Copula, an identifiable and interpretable generative model for multivariate data with arbitrary marginal distributions. The model places a hierarchical directed network of binary latent variables inside a copula framework, enabling flexible dependence modeling for mixed discrete and continuous data. Estimation is based on rank likelihoods, which decouple marginal modeling from posterior inference on the DDE parameters and avoid specifying the marginal distributions. We establish conditions for identification of the DDE copula parameters, ensuring that layer-specific parameters provide meaningful summaries of multivariate dependence. We also prove quotient-space posterior consistency for continuous margins under the exact rank likelihood and treat the extended rank likelihood for tied or mixed margins as a generalized likelihood, with concentration under an additional contrast condition. For computation, we propose a stochastic expectation-maximization algorithm for \emph{maximum a posteriori} estimation, together with initialization strategies that improve convergence. To learn network dimension adaptively, we extend Bayesian rank-selection priors to infer layer-specific widths. Simulations show strong finite-sample performance, and a personality-survey analysis reveals interpretable hierarchical latent structure in complex multivariate data.
Soft Specialists: $α$-Rényi Ensembles for Uncertainty-Aware LLM Post-Training
Cordero-Encinar, Paula, Tyukin, Georgy, Duncan, Andrew B.
Existing training approaches for large language models learn a single set of parameters, based on large volumes of data, which is typically heterogeneous, conflicting and often outright contradictory. As a result, the model is forced to compress conflicting goals, and inherent uncertainties into a single, averaged pattern of behaviour. We propose an $α$-Rényi variational framework for learning distributions over post-training parameters, offering an uncertainty-aware alternative to deep ensemble approaches. The resulting variational objective interpolates between classical variational Bayes and predictively oriented posterior learning, balancing between globally plausible individual models against systems of complementary specialists. We identify local stability criteria, demonstrating how model misspecification can make non-degenerate posterior spread locally favourable, manifesting contradictory or conflicting data as epistemic uncertainty. We apply our framework to LLM post-training, learning an ensemble of LoRA adapters attached to a shared, frozen base model, providing a scalable training procedure for both supervised fine-tuning and preference optimisation. Our approach enables training examples to be softly routed across ensemble members, promoting model specialisation and providing actionable uncertainty estimates across different tasks.
Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation
Kim, Wonyoung, Oh, Min-Hwan, Iyengar, Garud, Zeevi, Assaf
Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance depends on the variability of the interaction between the learner and the environment. In this paper, we develop a new theoretical analysis for MNL-based Markov decision processes that yields explicit variance-adaptive regret bounds. Our algorithm is computationally efficient and achieves the instance-wise optimal rate of regret, narrowing the gap between upper and lower bounds. Our numerical experiments validate that our method learns optimal policies more efficiently than conventional approaches.